Weigh your words - memory-based lemmatization for Middle Dutch

نویسندگان

  • Mike Kestemont
  • Walter Daelemans
  • Guy De Pauw
چکیده

This article deals with the lemmatization of Middle Dutch literature. This text collection—like any other medieval corpus—is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling, containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a language-independent system that can ‘learn’ intra-lemma spelling variation. We describe a series of experiments with this system, using Memory-Based Machine Learning and propose two solutions for the lemmatization of our data: the first procedure attempts to generate new spelling variants, the second one seeks to implement a novel string distance metric to better detect spelling variants. The latter system attempts to rerank candidates suggested by a classic Levenshtein distance, leading to a substantial gain in lemmatization accuracy. This research result is encouraging and means a substantial step forward in the computational study of Middle Dutch literature. Our techniques might be of interest to other research domains as well because of their language-independent nature. ................................................................................................................................................................................. 1 Spelling Variation in Middle Dutch Middle Dutch is a typical example of a historical language displaying a considerable amount of spelling variation (Van der Voort van der Kleij, 2005; Ernst-Gerlach and Fuhr, 2006; Kestemont and Van Dalen-Oskam, 2009; Souvay and Pierrel, 2009). Especially before the advent of the printing press, there existed no standard language variety of Dutch, let alone a standard spelling. As such, medieval Dutch spelling was generally highly phonological and ‘personal’ in nature, since it would represent each writer’s own dialectal pronunciation and local spelling habits. That is why even highly frequent words could be spelled in very different ways, reflecting the abundant variety of dialects and local substandards then found in the Low Countries (Fig. 1). This spelling variation makes it difficult to process medieval texts in any computational application. For instance for authorship attribution, it Correspondence: Mike Kestemont, Universiteit Antwerpen, Stadscampus, Prinsstraat 13, Room D.118, 2000 Antwerpen, Belgium. E-mail: [email protected] Literary and Linguistic Computing, Vol. 25, No. 3, 2010. ! The Author 2010. Published by Oxford University Press on behalf of ALLC and ACH. All rights reserved. For Permissions, please email: [email protected] 287 doi:10.1093/llc/fqq011 Advance Access published on 4 August 2010 at U nivrsiteit Antw epen Bibotheek on Agust 0, 2010 llc.oxfjournals.org D ow naded rom

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lemmatization for variation-rich languages using deep learning

In this article, we describe a novel approach to sequence tagging for languages that are rich in (e.g. orthographic) surface variation. We focus on lemmatization, a basic step in many processing pipelines in the Digital Humanities. While this task has long been considered solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to s...

متن کامل

Using the Lemmatization Technique for Phonetic Transcription in Text-to-Speech System

This paper deals with lemmatization technique and its using for the phonetic transcription of exceptional words. The lemmatizer is based on language morphology and uses the lexicon of word basic forms and inversion of the derivation rules to acquire the lemmatization rules which are essential for finding the word bases. We have described the lemmatization algorithm and necessary modifications o...

متن کامل

Using the Lemmatization Technique for Phonetic Transcription in Text-to-Speech System

This paper deals with lemmatization technique and its using for the phonetic transcription of exceptional words. The lemmatizer is based on language morphology and uses the lexicon of word basic forms and inversion of the derivation rules to acquire the lemmatization rules which are essential for finding the word bases. We have described the lemmatization algorithm and necessary modifications o...

متن کامل

LGeRM: lemmatization of Middle French words

Unlike most modern languages, Middle French is a language whose spelling is not yet stabilized. There is a great deal of variation in the spelling of a word and accordingly the traditional methods for lemmatization cannot be used. LGeRM (lemmes, graphies et règles morphologiques) proposes a solution based on a databank containing known lemmatized spellings and a set of graphical and morphologic...

متن کامل

Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full form-lemma pairs respectively. W...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • LLC

دوره 25  شماره 

صفحات  -

تاریخ انتشار 2010